Add/asyncio aiohttp acceleration #52

SuperKogito · 2021-02-14T20:55:52Z

This PR :

is related to Accelerate urls checking using concurrency (asyncio + aiohttp) #50
implements urls checks accelerations using asyncio and aiohttp libs instead of requests
adds more test files and tests for better coverage
drops support for python 3.5
attempts to fix urlchecker-action/issues/76

@vsoch I tried my best to make sure that all the features are working similarly to previous versions. However, due to the libs differences from requests, the timeout arg is now defined in minutes.
This PR will need some thorough testing due to the implemented major changes in the core of urlchecker. However, no stress so take your time 😉

codecov · 2021-02-14T21:02:21Z

Codecov Report

Merging #52 (595a5dc) into master (08fc1bb) will increase coverage by 13.98%.
The diff coverage is 94.54%.

@@             Coverage Diff             @@
##           master      #52       +/-   ##
===========================================
+ Coverage   76.76%   90.75%   +13.98%     
===========================================
  Files          12       12               
  Lines         383      400       +17     
===========================================
+ Hits          294      363       +69     
+ Misses         89       37       -52

Impacted Files	Coverage Δ
urlchecker/client/__init__.py	`78.43% <ø> (+3.92%)`	⬆️
urlchecker/logger.py	`100.00% <ø> (+57.14%)`	⬆️
urlchecker/client/check.py	`85.93% <75.00%> (+61.34%)`	⬆️
urlchecker/core/urlproc.py	`96.52% <95.91%> (+4.13%)`	⬆️
urlchecker/version.py	`100.00% <100.00%> (ø)`
urlchecker/core/check.py	`86.66% <0.00%> (+3.33%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08fc1bb...595a5dc. Read the comment docs.

vsoch

A good start! I had some questions, and then when there is a relatively ready version I'd like to test it locally and compare the times.

vsoch · 2021-02-15T17:21:18Z

CHANGELOG.md

@@ -12,6 +12,7 @@ and **Merged pull requests**. Critical items to know are:
 Referenced versions in headers are tagged on Github, in parentheses are for pypi.

 ## [vxx](https://github.com/urlstechie/urlschecker-python/tree/master) (master)
+ - accelerate code using asyncio and aiohttp (0.0.23)


This should be a version bump,up to 0.1.0 since we would be fundamentally changing the core library and making it un-usable for older python versions.

vsoch · 2021-02-15T17:21:54Z

README.md

@@ -10,6 +10,11 @@ and then test for and report broken links. If you are interesting in using
 this as a GitHub action, see [urlchecker-action](https://github.com/urlstechie/urlchecker-action). There are also container
 bases available on [quay.io/urlstechie/urlchecker](https://quay.io/repository/urlstechie/urlchecker?tab=tags).

+## Module Dependencies
+**Versions <= 0.0.22** are built around the [Requests](https://requests.readthedocs.io/en/master/) library whereas


Suggested change

**Versions <= 0.0.22** are built around the [Requests](https://requests.readthedocs.io/en/master/) library whereas

**versions <= 0.0.22** are built around the [Requests](https://requests.readthedocs.io/en/master/) library whereas

vsoch · 2021-02-15T17:22:34Z

README.md

@@ -88,7 +93,7 @@ optional arguments:
  --save SAVE           Path to a csv file to save results to.
  --retry-count RETRY_COUNT
                        retry count upon failure (defaults to 2, one retry).
-  --timeout TIMEOUT     timeout (seconds) to provide to the requests library
+  --timeout TIMEOUT     timeout (minutes) to provide to the aiohttp library


We should not the change the API here - if the user provides seconds, we should just convert to minutes for the library.

That makes sense. I tried to keep the numbers the same, that's why I only changed the unit, but I will fix this.

vsoch · 2021-02-15T17:23:44Z

tests/test_core_fileproc.py

@@ -53,18 +53,18 @@ def test_check_file_type(file_path, file_types):
    ["tests/test_files/sample_test_file.md", "tests/test_files/sample_test_file.py"],
 )
 @pytest.mark.parametrize(
-    "white_list_patterns", [["[.py]"], ["[.md]"], ["tests/test_file"]]
+    "exclude_patterns", [["[.py]"], ["[.md]"], ["tests/test_file"]]


Suggested change

"exclude_patterns", [["[.py]"], ["[.md]"], ["tests/test_file"]]

"exclude_patterns", [["[.]py$"], ["[.]md$"], ["tests/test_file"]]

can you explain this change to me please?

Sure! So your current regular expressions are matching exactly .py. We usually only need brackets when we want an exact match of a character that could be a regular expression (in the example above, the period). So I moved the brackets around just the periods, left the letters as is, and added a $ to indicate we only want to match the end of the line (e.g., we wouldn't want to match filename.python-bindings or .mdl

tests/test_files/sample_test_file.c

vsoch · 2021-02-15T17:30:57Z

urlchecker/core/urlproc.py

+            # When we break from while, we record final response
+            self.record_response(url, response)
+
+        except Exception as e:


Let's try to not have this extra exception block. We should be aware of issues going on and try to catch them explicitly or otherwise address them.

urlchecker/core/urlproc.py

vsoch · 2021-02-15T17:32:08Z

urlchecker/core/urlproc.py

-            # When we break from while, we record final response
-            self.record_response(url, response)
+        # handle different py versions support
+        if (3, 7) <= sys.version_info:


What is the difference here?

well I tested the code using python 3.6 and above. asyncio.run() was introduced in Python 3.7 and using the other alternative caused either warnings or errors so this was the best alternative imo to keep support for 3.6 and still provide a working code.

Let's step back a second - what are the advantages of making this change? So far, I'm seeing that it introduces more potential for error because of the overlapping requests (leading to 429), adds an additional dependency (requests was just one), it hard to maintain between changing versions, and makes it impossible to use for Python < 3.5. Before you do a ton of work, we should seriously consider if this change is improving the library.

Well there are clearly many drawbacks as you mention. The clear pro for this change is the huge speed difference. I cannot think of something else for now.

urlchecker/version.py

Co-authored-by: Vanessasaurus <[email protected]>

vsoch · 2021-02-15T19:24:56Z

I don't see where this comment is but I wanted to respond:

according to aiohttp the errors can happen in the session or the get call. The reason why I went with this implementation is that it was the best one to check the urls correctly. Also the tests don't necessarily provide the best coverage of all the exception possibilities that's why I am skeptical about the alternatives. One particular error I kept getting and I am not sure how to react to it other than increasing the pause variable was [429 too many requests] and that is why I tried to diversify the urls in the test files (that way I won't overwhelm the server).

This is actually my original concern - most servers implement rate limiting, so if you do too many requests too soon, you'll get a 429 and the ip address is blocked. We would want to speed up many things with aiohttp potentially, but checking (and retrying) urls is very likely to lead to being blocked (and the serial approach with a timeout is better). So this is my main concern for wanting to update this to a newer technology, aside from not supporting under python 3.5, we actually add a feature that makes it more likely to have an error.

SuperKogito · 2021-02-15T19:38:14Z

You are right about this, that's why I asked #52 if we should add this as an argument activated feature.
Should we drop this?

SuperKogito added 6 commits February 14, 2021 21:36

add new test-files

13e774b

update/add tests for better coverage

79589f5

accelerate code using asyncio and aiohttp

b170e31

update travis config

753e799

Update readme and changelog

13b5dcd

add requirements file

263dce8

SuperKogito requested a review from vsoch February 14, 2021 20:55

format scripts using black

1477482

vsoch reviewed Feb 15, 2021

View reviewed changes

SuperKogito and others added 2 commits February 15, 2021 11:07

use a version bump instead of version increase

9196853

Co-authored-by: Vanessasaurus <[email protected]>

change indentation in .c test file

595a5dc

Co-authored-by: Vanessasaurus <[email protected]>

SuperKogito mentioned this pull request Mar 27, 2022

testing multiprocessing for faster finds! #63

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add/asyncio aiohttp acceleration #52

Add/asyncio aiohttp acceleration #52

SuperKogito commented Feb 14, 2021

codecov bot commented Feb 14, 2021 •

edited

Loading

vsoch left a comment

vsoch Feb 15, 2021

vsoch Feb 15, 2021

vsoch Feb 15, 2021

SuperKogito Feb 15, 2021

vsoch Feb 15, 2021

SuperKogito Feb 15, 2021

vsoch Feb 15, 2021

vsoch Feb 15, 2021

vsoch Feb 15, 2021

SuperKogito Feb 15, 2021

vsoch Feb 15, 2021

SuperKogito Feb 16, 2021

vsoch commented Feb 15, 2021

SuperKogito commented Feb 15, 2021

	Versions <= 0.0.22 are built around the [Requests](https://requests.readthedocs.io/en/master/) library whereas
	versions <= 0.0.22 are built around the [Requests](https://requests.readthedocs.io/en/master/) library whereas

	"exclude_patterns", [["[.py]"], ["[.md]"], ["tests/test_file"]]
	"exclude_patterns", [["[.]py$"], ["[.]md$"], ["tests/test_file"]]

Add/asyncio aiohttp acceleration #52

Are you sure you want to change the base?

Add/asyncio aiohttp acceleration #52

Conversation

SuperKogito commented Feb 14, 2021

codecov bot commented Feb 14, 2021 • edited Loading

Codecov Report

vsoch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vsoch commented Feb 15, 2021

SuperKogito commented Feb 15, 2021

codecov bot commented Feb 14, 2021 •

edited

Loading